Abstract

Survival analysis is described as collection of statistical methods for which the response variable of interest is time until an event occurs. In this context, the time can be days, week, months and years from the beginning of follow-up of an individual until an event occurs, or the age of an individual when the event occurs. Moreover, the event can be death, disease, remission, recovery or any experience of interest that may occur to an individual. A more detailed information can be found in Kleinbaum and Marubini and Valsecchi.

Here we developed an easy-to-use, up-to-date, comprehensive and interactive web-based tool for survival analysis. This tool includes analysis procedures for life table, Kaplan-Meier and Cox regression. Each procedure includes following features:

Life table: descriptive statistics, life table, median life time, hazard ratios and comparison tests including Log-rank, Gehan-Breslow, Tarone-Ware, Peto-Peto, Modified Peto-Peto, Flemington-Harrington.

Kaplan-Meier: descriptive statistics, survival table, mean and median life time, hazard ratios, comparison tests including Log-rank, Gehan-Breslow, Tarone-Ware, Peto-Peto, Modified Peto-Peto, Flemington-Harrington, and interactive plots such as Kaplan-Meier curves and hazard plots.

Cox regression: coefficient estimates, hazard ratios, goodness of fit test, analysis of deviance, save predictions, save residuals, save Martingale residuals, save Schoenfeld residuals, save dfBetas, proportional hazard assumption test, and interactive plots including Schoenfeld residual plot and Log-Minus-Log plot.

Regularized Cox regression: variable selection and coefficient estimations using ridge, elastic net and lasso penalties.

Random survival forests: individual survival and cumulative hazard predictions using random survival forests, and interactive plots including, survival (with OOB), hazard (with OOB), error rate vs number of tree and cox regression vs random survival forest model.

1.Data upload

This tool requires a dataset in *.txt format, which is seperated by comma, semicolon, space or tab delimiter. First row of dataset must include header. When the appropriate file is uploaded, the dataset will be appear immediately on the main page of the tool. Alternatively users can upload one of the example datasets provided within the tool for testing and understanding the operating logic of the tool.

Data upload

Data upload help

2. Analysis Methods

2.1. Kaplan-Meier

Concept

Kaplan-Meier is a non-paranetric statistical method that is used to estimate survival probabilities and hazard ratios for a cohort study group. In clinical trials, it is often used to measure the part of patients living for a certain period of time after a treatment.

Variables

  • Survival time: Time until an event occurs (i.e. days, weeks, months, years)
  • Status variable: The event (i.e. death, disease, remission, recovery)
  • Category value for status variable: Category value of the event of interest (i.e. 1, yes)
  • Factor variable: A categorical variable which indicates different study groups (i.e. treatment, gender)

Usage

A Kaplan-Meier analysis can be conducted by applying the following steps:

  1. Select the analysis method as Kaplan Meier from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable and factor variable, if exists.
  3. In advanced options, one can change confidence interval type, as log, log-log or plain, variance estimation method, as Greenwood or Tsiatis, comparison test type, as Log-rank, Gehan-Breslow, Tarone-Ware, Peto-Peto, Modified Peto-Peto or Flemington-Harrington, confidence level and reference category, as first or last.
  4. Click Run button to run the analysis.

Survival help

Outputs

Desired outputs can be selected by clicking Outputs checkbox. Available outputs are;

a. Case summary

Summary statistics, such as number and percent of observations, events and censored cases can be obtained.

b. Survival table

A survival table can be created. First column in the table represents factor group and number of time points (i.e. 1.2 means second time point in the first factor group, likewise 2.1 means first time point in the second group). Second column is survival time, third column gives number of subjects at risk, fourth column is the number of events, fifth column represents the cumulative probability of surviving, sixth, seventh and eight columns are associated standard error, lower and upper limits, respectively.

c. Survival plot

A forest plot can be created for each level of factor group using survival probabilites at each end point.

d. Mean and Median life time

Mean and median life time and their associated confidence levels can be calculated for each level of factor group.

e. Hazard ratio

Hazard ratios and their respective lower and upper limits can be calculated for each factor group at each end point.

f. Hazard plot

A forest plot can be created for each level of factor group using hazard ratios at each end point.

g. Comparison tests

Six different comparison tests can be calculated for testing the differences in survival probability estimations between factor groups.

h. Plots
i. Kaplan-Meier curve

Kaplan-Meier curves can be created. A number of edit options is also available for plots.

j. Hazard plot

Hazard plot can be created. A number of edit options is also available for plots.

k. Log-Minus-Log plot

Log-Minus-Log plot can be created. A number of edit options is also available for plots.

2.2. Cox Regression

Concept

Cox regression, also known as proportional hazard regression, is a method to investigate the effect of one or multiple factors upon the time an event of interest occurs. In this model, the effect of a unit increase in a factor is multiplicative with respect to the hazard rate.

Usage

A Cox regression analysis can be conducted by applying the following steps:

  1. Select the analysis method as Cox Regression from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable, and categorical and continuous predictors for the model.
  3. In advanced options, interaction terms, strata terms and time dependent covariates can be added to the model. Moreover, if there are multiple records for observations, users can specify it by clicking Multiple ID checkbox. Furthermore, once can choose model selection criteria, as AIC or p-value, model selection method, as backward, forward or stepwise, reference category, as first or last, and ties method, as Efron, Breslow or exact and change the confidence level.
  4. Click Run button to run the analysis.

Cox Regression help

Outputs

Desired outputs can be selected by clicking Outputs checkbox. Available outputs are coefficient estimates, hazard ratio, goodness of fit tests, analysis of deviance, predictions, residuals, Martingale residuals, Schoenfeld residuals and DfBetas.

a. Coefficient Estimates

A coefficient estimation table, which includes variable names, coefficient estimates and their associated standard errors, z statistics and p values, can be created.

b. Hazard ratio

A hazard ratio table, which includes variable names, hazard ratios and their associated lower and upper limits, can be created.

c. Hazard plot

A forest plot can be created for hazard ratios to give them a visual inpection.

d. Goodness of Fit Tests

Fitted Cox regression model can be tested with three tests: Likelihood ratio, Wald, Score.

e. Analysis of Deviance

A deviance analysis can be conducted for each variable in the fitted model.

f. Predictions

Predictions from the fitted model can be obtained.

g. Residuals

Residuals from the fitted model can be obtained.

h. Martingale Residuals

Martingale residuals from the fitted model can be obtained.

i. Schoenfeld Residuals

Schoenfeld residuals from the fitted model can be obtained.

j. DfBetas

DfBetas residuals from the fitted model can be obtained.

k. Proportional Hazard Test

To check the proportionality assumption of Cox regression model, a proportional hazard test can be conducted both globally and for each variable in the fitted model.

l. Schoenfeld Plot

Beside a formal test for proportionality assumption, a Schoenfeld plot can be created to check the assumption visually.

m. Log-Minus-Log Plot

Another useful plot for checking proportionality assumption is log-minus-log plot. Lines should be parallel to each other to satisfy proportionality.

2.3. Penalized Cox Regression

Concept

Feature selection is an useful strategy to avoid over-fitting, to obtain more reliable predictive results, and to provide more insights into the underlying casual relationships (Ma and Huang, 2008). In this section, a feature selection can be performed using ridge, elastic net or lasso penalty, especially when there are too many predictors (e.g. n<<p). More information can be found in Zou and Hastie, 2005, Freidman et al, 2008 and Simon et al, 2011.

Usage

A Penalized Cox regression analysis can be conducted by applying the following steps:

  1. Select the analysis method as Penalized Cox Regression from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable
  3. If all predictors are continious then one can check the Select All Variables option to include all variables in dataset to the feature selection process. If some predictors categorical and others are continious, then uncheck the Select All Variables option and select categorical and continuous variables seperately.
  4. Define the penalty term using the Penalty term slider as follow:

Penalty term = 0: ridge penalty 0 < Penalty term < 1: elastic net penalty Penalty term = 1: lasso penalty

  1. Select the number of folds for cross-validation. Note that number of folds must be greater than 3.
  2. Click Run button to run the analysis.

Cox Regression help

Outputs

a) Variables in the model

Variable selection is conducted with the selected penalized method (i.e. ridge, elasticnet, lasso) and results will be displayed as a table, which includes selected variables and their associated coefficient estimates.

b) Cross-validation curve

A cross-validation curve can be created to investigate the relationship between partial likelihood devaince and lambda values.

2.4. Random Survival Forests

Concept

Random survival forests, an ensemble method for analysing right censored data, first introduced by Ishwaran et al, 2008. RSF has several advantages over Cox regression: (i) Unlike Cox regression, RSF does not rely on proportional hazard assumption. (ii) RSF accounts for nonlinear effects and interactions for factor variables.

Usage

A random survival forests analysis can be conducted by applying the following steps:

  1. Select the analysis method as Random Survival Forests from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable, and categorical and continuous predictors for the model.
  3. In advanced options, interaction terms, strata terms and time dependent covariates can be added to the model. Moreover, if there are multiple records for observations, users can specify it by clicking Multiple ID checkbox. From RSF options, number of tree, bootstrap method, randomly selected number of variable, minimum number of cases in terminal node, maximum depth for a tree, splitting rule, number of split, missing values, number of iterations of the missing data algorithm, proximity of cases, size of bootstrap and type of bootstrap can be adjusted.
  4. Click Run button to run the analysis.

Cox Regression help

Outputs

a. Individual Survival Predictions

Survival predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.

b. Individual Survival Predictions OOB

Out of bag (OOB) survival predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


c. Individual Cumulative Hazard Predictions

Cumulative hazard predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


d. Individual Cumulative Hazard Predictions OOB

Out of bag (OOB) cumulative hazard predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


e. Error Rate

An error rate table, which shows error rate estimations for each tree, can be obtained.


f. Variable Importance

A variable importance table as well as an interactive plot, which shows relative importance of variables in fitted model, can be obtained.


g. Random Survival Plot

A survival plot can be drawn for survival predictions from random survival forests model. Each line represents a survival curve for each observation.


h. Survival OOB Plot

A survival plot can be drawn for OOB survival predictions from random survival forests model. Each line represents a survival curve for each observation.


i. Cumulative Hazard Plot

A cumulative hazard plot can be drawn for hazard predictions from random survival forests model. Each line represents a survival curve for each observation.


j. Cumulative Hazard OOB Plot

A cumulative hazard plot can be drawn for OOB cumulative hazard predictions from random survival forests model. Each line represents a survival curve for each observation.


k. Error Rate Plot

An interactive error rate plot, which shows error rate alterations when number of trees increased, can be drawn.


l. Cox vs RSF

A Cox model can be compared to random survival forests model through an interactive plot for visual inspection of both models.